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ON WEIGHT MATRIX AND FREE ENERGY MODELS FOR 
SEQUENCE MOTIF DETECTION* 

By Qing ZHOut 
University of California, Los Angeles 

The problem of motif detection can be formulated as the construc- 
, tion of a discriminant function to separate sequences of a specific 

pattern from background. In computational biology, motif detection 
is used to predict DNA binding sites of a transcription factor (TF), 
mostly based on the weight matrix (WM) model or the Gibbs free 
energy (FE) model. However, despite the wide applications, theoret- 
\^ \ ical analysis of these two models and their predictions is still lacking. 

We derive asymptotic error rates of prediction procedures based on 
these models under different data generation assumptions. This al- 
, , lows a theoretical comparison between the WM-based and the FE- 

■ based predictions in terms of asymptotic efficiency. Applications of 
' the theoretical results are demonstrated with empirical studies on 

ChlP-seq data and protein binding microarray data. We find that, 
irrespective of underlying data generation mechanisms, the FE ap- 
\ proach shows higher or comparable predictive power relative to the 

WM approach when the number of observed binding sites used for 
constructing a discriminant decision is not too small. 

Key words: asymptotic efficiency, discriminant analysis, protein- 
^SJ ■ DNA interaction, predictive error, transcription factor binding site. 
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I 1. Introduction. Transcription factors (TFs), a class of proteins, regulate gene 

■ transcription through their physical interactions with particular DNA sites. Such a 
DNA site is called a transcription factor binding site (TFBS), which is usually a short 

O \ piece of nucleotide sequence (e.g., 'CATTGTC'). Typically, a TF can bind different 

sites and regulate a set of genes. A key observation is that sites of the same TF share 
similarity in their sequence composition, which is characterized by a motif. Since gene 
regulation has always been an important problem in biology, many computational 
methods have been developed to predict whether a given DNA sequence can be bound 
. by a TF. Please see Elnitski et al. (2006), Ji and Wong (2006), and Vingron et al. 

(2009) for recent reviews on relevant methods. 

The prediction of TFBS's considered in this article is formulated as a classification 
problem. Denote by w the width of the binding sites and code the four nucleotide 
bases. A, C, G and T, by a set of positive integers X = {1, • • • , J} (J = 4). Suppose 
that we have observed a sample of labeled sequences of length w, Dn = {iXk, -^fc)}fc=i; 
where G 1-^ and G {0, 1} indicating whether is bound by the TF (1^ = 1) 
or not (Yfc = 0). We call Z)+ = {X^ : Y/c = 1} observed binding sites (or motif sites) 
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and D~ = {Xk : = 0} background sites (or background sequences). Then, motif 
detection is to construct a discriminant function from D,„ to predict the label of any 
new sequence x G I'^. 

Most of the existing computational methods for motif detection can be classified into 
two groups. The starting point of the first group is the sequence specificity of binding 
sites, which is often summarized by the position-specific weight matrix (WM). For early 
developments of WM, please see Stormo (2000). Under the WM model, each nucleotide 
(letter) in a binding site is assumed to be generated independently from a multinomial 
distribution on {A, C, G, T}. This model has been widely used in search of TFBS's 
(e.g.. Hertz and Stormo, 1999; Kel et al., 2003; Rahmann et al., 2003; Turatsinze 
et al., 2008), de novo motif finding (e.g., Stormo and Hartzell, 1989; Lawrence et al., 
1993; Bailey and Elkan, 1994; Roth et al., 1998; Liu et al., 2002) and many other works 
reviewed in Vingron et al. (2009). The second group aims at modeling physical binding 
affinity between a TF and its binding sites via the concept of the Gibbs free energy 
(FE) or binding energy (e.g.. Berg and von Hippel, 1987; Stormo and Fields, 1998; 
Gerland et al., 2002; Kinney et al., 2007). Assuming that each nucleotide in a DNA 
sequence of length w (^y-mer) contributes additively to the interaction with the TF, 
this approach often leads to a regression- type model for the conditional distribution of 
binding affinity given a piece of nucleotide sequence (e.g., Djordjevic et al. 2003; Foat 
et al. 2006). This group of methods have tight connections with predictive modeling 
approaches to gene regulation, reviewed in Bussemaker et al. (2007), which can be 
regarded as a natural generalization to the free energy framework (Zhou and Liu, 2008). 
Although the standpoints are different, the two groups of approaches are in some sense 
closely related. They often give similar discriminant functions for predicting TFBS's, 
and there are many FE-based methods that use a weight matrix to approximate Gibbs 
free energy (e.g., Granek and Clarke, 2005; Roider et al., 2007). 

In spite of the fast methodological development on the WM and the FE models, 
there is still a lack of solid theoretical analysis to compare model assumptions, pa- 
rameter estimations and response predictions of the two approaches. Such theoretical 
analysis can provide insights into these methods by seeking answers to a series of ques- 
tions. For example, what are the common and distinct assumptions between the WM 
and the FE models, what is the relative performance between the two approaches in 
predicting TFBS's given a certain data generation mechanism, and how to calculate 
their predictive error rates when the size of observed sample Z)„ becomes large? With- 
out answering these questions, one may find it difficult to understand the nature of 
these methods and cannot extract the full information contained in extensive empirical 
comparisons between the two approaches. 

In this article, we compare model assumptions and parameter estimations of typical 
WM and FE approaches, derive asymptotic error rates of their predictions under dif- 
ferent data generation models, and perform comparative studies on large-scale binding 
data. The article is organized as follows. In Section 2 we review the basic models of the 
two approaches. Asymptotic error rates of prediction procedures based on these mod- 
els are derived and analyzed in Section 3. Computational approaches are developed in 
Section 4 for practical applications of the theoretical results. Numerical analysis and 
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biological applications are presented in Sections 5 and 6, respectively, with a compar- 
ison of the WM-based and the FE-based predictions on ChlP-seq data and protein 
binding microarray data. The paper concludes with discussions in Section 7. Some 
mathematical details are provided in Appendices. Although presented in the specific 
context of motif detection, the results in this article are generally applicable to the 
modeling and classification of categorical data. 

2. Models. Let c be a scalar, u = {ui,--- ,uj) be a (column) vector, v = 
{vi,--- ,Vw) G I^, and A = {aij)wxj and B = {hij)wxj be two w x J matrices. 
For notational ease, we define c ± A := (c ± aij)wxj, A/B := {aij /bij)wxj provided 
that bij / 0, vA := YlT=i "m^, A{v) := HlLi and u{v) := Hili Uv,- Furthermore, 
we define I'l-fe] := (vi,-" ^Vk-ijVk+i, ■ " I'^w) and A[_j^j by removing the kth. row 

L P 

from A, for k = 1, • • • , Symbols '— t-' and '— are used for convergence in law and 
in probability, respectively. 

Let 9q = {9qi, • • • , 9qj) be the cell probabilities (probability vector) of a multinomial 
distribution for i.i.d. background nucleotides, where X]j=i ^Oj ~ ^ ^Oj > for 
j = 1, • • • , J. Since 6q can be accurately estimated from a large number of genomic 
background sequences, we assume that it is given in the following analyses. Throughout 
the paper, we assume that the cell probabilities of any multinomial distribution are 
bounded away from 0. 

2.1. The weight matrix model. Let X = {Xi,--- ,X^) G be a sequence of 
length w. In the weight matrix model (WMM), we assume that X is generated from a 
mixture distribution. Let Y £ {0, 1} label the mixture component. With probability qq, 
Y = and X is generated from an i.i.d. background model (with parameter) Oq, that 
is, P{X I y = 0) = 9q{X). With probability qi = 1 — qq, Y = 1 and X is generated 
from a weight matrix = {6ij)wxj = (^ir"" ^^wY, where 6i = {On,-- - ,Oij) is a 
probability vector for i = 1, ■ ■ ■ ,w and Xi is independent of other X^. {k ^ i). To be 
specific, P{X \Y=1) = @{X). From this model the log-odds ratio of Y given X is 

log = I I ^\ = log = log{q^/qo) + J2 log{9ixjeoxj. (1) 

P{Y = I A) qoOo[X) ^ 

In the WM-based prediction, qi is typically fixed by prior expectation or determined by 
the relative cost of the two types of errors (false positive vs false negative) . Effectively, 
we assume that qi is given. Let 

/3o = log(gi/(/o), ftj =log(%/S), (2) 
for 1 < i < w, 1 < J < J and (3 = {f3ij)wxj- We rewrite (1) as 

P(Y = 1 I X) ^ 
log p\y = q \ X) = ^0 + E =Po + Xl3 := h{X), (3) 

which defines an additive discriminant function to predict Y given X, i.e., to predict 
whether the sequence X can be bound by the TF. The label Y will be predicted as 
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1 if h{X) > and otherwise. This prediction can be regarded as a naive Bayesian 
classifier. 

Given observed binding sites , we estimate by the maximum likelihood estima- 
tor (MLE) 0™ = (0^", • • • , 0™)* and substitute it in equation (2) to obtain Here, 
the superscript 'm' stands for estimators based on the WMM. Let dO^ = 0™ — 9i, 
which is an infinitesimal in the order of l/i/H as n — t- oo. The standard asymptotic 
theory (e.g., Ferguson 1996) implies that 

J 

A AA(0, S[") restricted to ^ dd"^ = 0, as n ^ oo, (4) 



and that ^/ndOf^, i = 1, • • • ,111, are mutually independent. The (j, A;)th element of the 
covariance matrix is {Sj^Oij — 0ij9ik)/qi where 5jk is the Kronecker delta symbol 
and 1 < j,k < J. From equation (2) we have d/3^ = dOYj/Oij, which leads to the 
following limiting distribution, 

V^d$f}AM{0, (1 - for j = 1, • • • , J, as n ^ 00, (5) 

with ^/ndP'^ mutually independent for i = 1, - ■ ■ ,w. 

2.2. The free energy model. Let F, X = {Xi, ■ ■ ■ ,Xw) and FX be a TF, a DNA 
sequence, and the corresponding TF-DNA complex, respectively. The process of the 
TF-DNA interaction can be described by the chemical reaction F + X = FX. The 
concentrations of the three molecules at chemical equilibrium, [F], [X] and [-FX], are 
determined by the association constant Ka{X), that is, 

[F][X] ^ ' ^\ RT y 

where A.G{X) is the Gibbs free energy (FE) for the interaction of F with X, R'ls the 
gas constant and T the temperature. We regard RT > as a constant. Suppose that 
the contribution of a single nucleotide Xi to the FE is additive (von Hippel and Berg, 
1986; Benos et al., 2002) so that we may write -AG{X)/{RT) = c + Y,T=i hx, - Then 
we have 

\FX] ^ ^ 

log ^ = log[F] + C + hx, ■■=bo + Yl • (6) 

^ i=l i=l 

To avoid non-identifiability in estimation, we take Sref = (si; ■ " ' ? ^w) as a reference 
sequence to determine a baseline level of the FE, and define /3ij = bij — his^ for all i, j 
and /3o = &o + Y17=i ^isi^ such that /Sis- = for i = 1, • • • ,w and 

w w 

&o + ^&^x, =^o + 5]ftx, (7) 

i=l 1=1 
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for every X € I^. Let Y be the indicator for whether X is bound by the TF at 
chemical equihbrium. From the physical meaning of concentration, 

Combining equations (6), (7) and (8) leads to an additive discriminant function for 
this free energy model (FEM), 

P(Y = 1 I X) ^ 
log p\y = o \ X) " ^° ^ ^ =h + Xp := h{X). (9) 

Similarly as for the WMM, we assume that /3o is fixed by prior or a desired cost. 
Furthermore, it is conventional to assume that X is sampled from an i.i.d. background 
model Oq, i.e., P{X) = 9q{X). The data generation process of the FEM has a clear 
biological meaning. Suppose that we have sampled n nucleotide sequences of length 
w, {Xk G 2^"'}^=!, from the genomic background 6q. We mix these sequences with 
TF molecules in a container where the concentration of the TF is held as a constant. 
At chemical equilibrium we label the sequences X^ bound by the TF as Yfc = 1 
and otherwise Yfc = 0. The output of this experiment is the labeled sample Dn = 
{{Yk, Xk)}^^i- Although there exist other models based on binding free energy, we 
focus on this basic model in this paper, which makes a theoretical analysis relatively 
clean while capturing main characteristics of FE-based approaches. 

Given D^, the MLE of /3, denoted hy 0^ = ^ + d(3^ with the superscript '/' for 
FE-based estimators, can be calculated by the standard logistic regression. Note that 
maximizes the conditional likelihood 

P(y|X.«^*i^ (10, 

1 + exp(/^;o + X(3) 

determined by equation (9). Similar to the results in Efron (1975), it is not difficult 
to demonstrate that (3^ is consistent for j3 with asymptotic normality, 

as n — )• oo, (11) 

where dfif is regarded as a vector of (J — \)w dimensions (recall that /3is. = /3£. = 
for i = 1, • • • , w). The asymptotic covariance matrix 

= [^e, {vi{X)po{X)CxC'x]Y^ , (12) 

where Py{X) = P{Y = y \ X) for y = 0, 1, Cx is a (J — l)7i;-dimensional column 
vector coding each Xi as a factor of J levels, and E^p is taken with respect to (w.r.t.) 
the background model Oq that generates the sequence X. 
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2.3. Comparison. Given (/3o,/3) in the WMM and the reference sequence Sref in 
the FEM, if we let 

w 

^o = /3o + 5^A.,, fti = A,--/3i.,, (13) 
1=1 

for i = 1, - ■ ■ ,w, j = 1, - ■ ■ , J, then the two models have the same conditional distri- 
bution [Y I X] (3, 9) for any X. To simplify notations, we shall denote the decision 
function (9) in the FEM by h{X) = /3o + XP hereafter. Except for this condition 
distribution, other model assumptions are different. The WMM assumes that the nu- 
cleotides in X are generated independently given its label Y. But this is not true for 
the FEM, in which the conditional probability of X given Y is 

P{X I Y, FEM) oc P{Y I X, FEM)P(X | FEM) = ^^Eii^^±^^lXl6>o(X). (14) 

1 + exp(/3o + Xf3) 

Since equation (14) cannot be written as a product of functions of Xi, this model 
implicitly allows dependence among Xi, - ■ ■ , Xw Consequently, the FEM may account 
for some observed nucleotide dependences within a motif such as in Bulyk et al. (2002), 
Barash et al. (2003), Zhou and Liu (2004), and Zhao et al. (2005) among others. On 
the other hand, under the FEM model the marginal distribution of X is simply the 
background nucleotide distribution, i.e., P{X \ FEM) = 6q[X), but the marginal 
distribution of X under the WMM is a mixture, 

P{X I WMM) = qi&{X) + go^o(^)- (15) 

The different model assumptions lead to different procedures for parameter estima- 
tion, in particular the coefficients j3 {(3). As discussed in Sections 2.1 and 2.2, (3"^ and 
0-^ are consistent under the WMM and under the FEM, respectively. Since 0-^ maxi- 
mizes the conditional likelihood P{Y \ X,f3) (10) which is identical between the two 
models, it is also consistent for f3 under the WMM up to the translation (13). However, 
0^ is expected to be less efficient than 0"^ in prediction if the WMM corresponds to 
the underlying data generation process, due to the ignorance of the information on 
contained in the marginal likelihood P{X \ 0,WMM) (to be discussed in detail 
in Section 3.1). Conversely, if data are generated by the FEM, (3"^ is biased and no 
longer consistent. We will analyze the bias and the resulting incremental error rate in 
later sections. 

3. Theoretical results. For both WMM and FEM, the ideal decision function 
h{x) is obtained with the true parameters of the respective models and the corre- 
sponding ideal error rate 



Denote by 



R* = p(Y = 0, X = a;) + ^ P{Y = l,X = x). (16) 

x:h{x)>S) x:h(x)<0 



R*(x)= min P(Y = y \ X = x) 
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the ideal error rate for h{x) given X = x. Consider a decision function h{x) estimated 
from Dn- Given any x for which h{x)h{x) < 0, the incremental error rate beyond 
R*{x) is 

AR{x) = \P{Y = l\ X = x)- P{Y = Q\X = x)\. 
Then the expectation of the total incremental error rate for h is 



E[Ai?(/i)] = E 



AR{x)P{X = x)l ^h{x)h{x) < 0} 
= ^R{x)P{X = x)P ^h{x)h{x) < 0} , (17) 

where l(-) is the indicator function. Please note that /i, constructed from a sample 
of size n, is a random function. Let /\h{x) = h{x) — h{x) be the deviation of h{x) 
from h{x). In what follows, we will derive two theorems on E[Ai?(/i)] under different 
assumptions for /\h{x). As we will see, the asymptotic error rates of the WM and the 
FE procedures under the data generation models discussed in this paper can all be 
calculated based on the two theorems. 

Suppose that, for every x, ^/nAh{x) -^M{0, V{h, x)), where V{h, x) is the asymp- 
totic variance. As n — )• 00, 

E[AR{h)] Y ^R{x)P{X = x)P{Ah{x) > \h{x)\) 

= Y ^Rix)PiX = x)<^l-^Jnh^{x)/V{h,x)\, (18) 

where $ is the cdf of the standard normal distribution M{0, 1). Let 

a(h)= min h'^(x)/V(h,x), (19) 

h{x)j^O 

and X* be the corresponding minimum. Note that AR(x) = when h{x) = 0. Thus, 
as n 00, E[AR{h)] is dominated by the term AR{x*)P{X = x*) <P -{naih)}^/"^ 

where a{h) determines the rate of convergence. Using the theory of large deviations, 
we obtain: 

Theorem 1. If ^/nAh{x) J\f{'d,V{h,x)) for every x eZ^ then 

-logE[Ai?(/i)l ^ as 00. 
n 2 

Let h°' and h!' be two estimated decisions constructed from samples of size iia and n^, 
respectively. Suppose that both of them satisfy the condition in Theorem 1. We define 
the asymptotic relative efficiency (ARE) of h"" with respect to h!' by ARE(/i", N') = 
a^h"") / a{h^) , which is the limit ratio n^/na required to achieve the same asymptotic 
performance. 
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If h{x) is biased in the sense that ^/n{Ah{x) — fi{h,x)} —>M{0,V{h,x)), where 
li{h,x) denotes the asymptotic bias of h{x), then simple derivation from equation 
(17) gives that 



E[Aii(A)] ^ ^ AR{x)P{X = x)^ 



xex^ 



n sign{h{x)}{h{x) + fi{h,x)} 



\/v{h,x) 



as n — 7- oo, where sign(y) is the sign of y with sign(O) = 0. 

Theorem 2. Suppose that ^/n{Ah{x) - fi{h, x)} ^ M {0,V {h, x)) for every x G 
I"". Let B{h) = {x : sign{h{x)}{h{x) + fi{h, x)} < 0}. Then 

E[AR{h)] AR{x)P{X = x), asn^oo. (20) 

We ignore the case {x : h{x) + ^{h, x) = 0} which practically never happens. The 
set B{h) is the collection of x for which the estimated decision h gives a different 
predicted label from the ideal decision as n — >• oo. Note that E[Ai?(/i)] does not 
vanish if B{h) is nonempty. Thus, the incremental percentage over the ideal error rate, 
E[Ai?(/i)]/i?*, is an appropriate measure of the predictive performance of h. 

In the remainder of this section, we derive and compare the error rates of the WM 
and the FE procedures. From Sections 3.1 to 3.4, we assume that the constant term 
Po{/3o) is fixed to its true value. The results are generalized to situations where the 
constant is mis-specified in Section 3.5. The computation of a{h) (19) and E[Ai?(/i)] 
(20) will be discussed in Section 4. 

3.1. Error rates under WMM. In this subsection we assume that the underlying 
data generation process is given by the WMM. Since both /j™ and f3^ are consistent 
with asymptotic normality under the WMM, we may uniformly denote their deci- 
sion functions by h{x) = /3o + xf3 = h{x) + xdp, where d(3 = (3 — (3 and ^/ndp 
follows a normal distribution AA(0, S) as n — )• oo. This implies that ^/nAh{x) = 

^/nxd$ A-J\f{O,V'^{0,x)) with V'^{0,x) being the asymptotic variance. The super- 
script 'm' indicates the WMM as the data generation model. Let E[Ai?™(/3)] be the 
expected incremental error rate of h indexed by (3. Following Theorem 1, 

-logE[Ai?-(/3)]^-^^ = -- min -^^^t^, (21) 

as n — )• oo. Consequently, the ARE of the FE procedure w.r.t the WM procedure, 
ARE'"(/3^,/3™), is determined by the ratio of a™(/3^) over a'"(/3™). 

The decision function of the WM procedure is constructed with /3™ (Section 2.1). 
Note that xd/d = d^ix^ is a summation of w d/3ij's, each from a different d(3i. The 
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limiting distribution of ^/ndj3^j (5) and tlie mutual independence among imply 
that the asymptotic variance of y/nxd(3^ is 



1 



JMx, = -a:{(i-e)/e}, 



and consequently, 



a [p ) = mm 



»3/3^-/3o a;{(l-0)/0}- 

Suppose that we have chosen (si, • • • , s^) as the reference sequence in the FE pro- 
cedure. Define /3o and $ from the parameters (/3o)/3) of the WMM by equation (13). 
Then the FE-based estimator /3-^ is consistent for with asymptotic normality. Let 
d0^ = 0^ — l3. Similar to equation (12), the asymptotic covariance matrix of y/ndfSf 

is \K {^pi{X)pq{X)CxC^xY\ ' '^here the expectation is taken w.r.t. the marginal 
distribution of X under the WMM (15). Thus the covariance matrix can be written 
as 

MX) 



Gov" 



(22) 



where the expectation Mg^ averages over X G I'^ generated from the background 
model Oq- Based on equation (22), one can calculate the variance of ^/nxdi3^ for every 
X and determine the convergence rate a"^{0^) of the expected incremental error rate 
E[Ai?'"(/3^)] for the FE procedure. 

Because the estimation of 0-^ is only based on the conditional distribution [Y \ X] 
while 0"^ is estimated from the joint distribution of Y and X, we expect /3-^ to be 
less efficient in prediction with a"^{0^) < a'"(/3"*). We will conduct a numerical 
study in Section 5 to evaluate ARE'"(/3-^, /S™) on 200 transcription factors to confirm 
our conclusion. Here we demonstrate the lower efficiency of 0^ by the loss of Fisher 
information in estimating an individual 9ij from the conditional likelihood only. For 
simplicity, suppose that is given and collapse Xi into two categories, Xi = j and 

Xi ^ j. Because 

P{X,Y\ 0) = P{Y I X,@)P{X I 0) 

under the WMM, the loss of information equals the Fisher information on 9ij contained 
in the marginal likelihood P{X \ 0), denoted by I{6ij \ X). Let I{6ij \ X,Y) be the 
Fisher information on 6ij given X and Y jointly. We define 

A(% I [Y I X]) = I{e,, I X)/I{e^, I X, Y) (23) 

as the fraction of the loss of information on 6ij in the conditional likelihood PiY \ 
X,@). 

Proposition 3. Let 6ij = qoOoj + qiOij. If {Y, X) is drawn from the WMM then 
A(% I [Y I X]) > '^f f^ 7f := i?(gi,%,S)- 
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A proof of this proposition is given in Appendix A. If one chooses to include an equal 
number of background sites {Y = 0) and binding sites {Y = 1) in logistic regression to 
estimate 0-^ , which effectively specifies qo = qi = 0.5 by design, then this lower bound 
may be substantial. For example, with a uniform background distribution Oqj = 0.25 
for j = 1, • • • , 4, the range of B{qi,6ij,6oj) is between 20% and 55% for most typical 
values of 9ij (Table 1). 

Table 1 
Typical values of B{0.5, Oij , 0.25) 



Oij 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 

5(0.5, 6'.ij ,0.25)(%) 31 46 53 55 53 49 42 32 18 



3.2. WMM with Markov background. We generalize the background model to a 
Markov chain, which often represents a better fit to genomic background in high 
organisms. We assume that given Y = 0, X is generated by a first order Markov 
chain with a transition probability matrix i/>o = {tpoi^i y))jxJ where x,y £ I. For any 
X = (xi,--- ,Xw) ipo{x) ■= YIi=i'^o{xi-ijXi)j where ipo{xQ,xi) is interpreted as 

the probability of xi under the stationary distribution of the Markov chain. The ideal 
decision function under this model is 

pf'^ I ^\ ^ 

hi{x) = log ^ ^ Q I ^ = Po + ^^og9ia^^ -log'il)o{xi-i,Xi), (24) 

1=1 

where /3o = log(§i/c/o) and the subscript '1' [in hi{x) and Ai?™ (26)] indicates a quan- 
tity whose definition involves a Markov background model. Since i/jq can be accurately 
estimated with sufficient genomic background sequences, we assume that it is given. 
With the MLE 0™ from observed binding sites, the WM procedure constructs a de- 
cision whose expected incremental error rate converges to zero exponentially fast as 
n — 7- oo, following Theorem 1. 

With a slight abuse of notations, let us denote by Oq the probability vector of the 
stationary distribution of the Markov chain, which is also the marginal distribution 
of any nucleotide Xi in a background site. We still define /3q and (3 by equation (2) 
with Oq being the stationary probabilities, and translate /3o and (3 via a reference 
sequence to (3q and /3 (13). Let {Y,X) be a sample from the WMM with Markov 
background. If the dependence among neighboring nucleotides in a background site is 
ignored, the conditional likelihood P(Y \ X,(3), parameterized by (3, is then given by 

the same expression in equation (10). Because the FE-based estimator 0^ maximizes 

' f p ~ 

this conditional likelihood, it is standard to show that (3^ ^ (3 and is asymptotically 
normal. Let h^{x) = /3o + a;/3-^ denote the estimated decision function of the FE 
procedure. As ?i — )• oo, 

w 

hf{x)^^o + x0 = (3o + Y,^og Oi,^ - log 0ox. . (25) 

1=1 
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Let Ah^ (x) = hf {x) — hi{x) be the deviation of h-f (x) from the ideal decision (24). 
Comparing equations (24) and (25) gives the asymptotic bias, 

w 

b{x) = '^logipoixi-i,Xi) - log6'oxi = logV'o(^c) - Iog0o(a3)- 

1=1 

Due to the asymptotic normality of 0-^ , we have ^/n{Ay {x)-b{x)} 4 AA(0, V{^0f,x)), 
where Vf'ip^ , x) is the corresponding asymptotic variance. Under this model, 

AR{x)P{X = x) = \qi@{x) - qMx)\ = qoM^) \e'''^''^ - 

Following Theorem 2 with fi{h^,x) = b{x), the expected incremental error rate 

E[Ai27^(/3^)] ^ go e^i(^) - 1 iPo{x), as n ^ oo, (26) 

where Bi^{0^) = {x : sign{/ii(a;)}{/ii(a;) + b{x)} < 0}. The incremental percentage 
over the ideal error rate, 1K[AR'^{0^)]/{K^)* , is appropriate for comparing the FE- 
based prediction with the WM-based prediction whose expected error rate converges 
to {R^y . A general expression for R* is given in equation (16) which, under the WMM 
with Markov background, is written as 

[R^r = goIEv,„ {l(/ii(X) > 0) + e'^iWl(/ii(X) < 0)} . (27) 

3.3. Error rates under FEM. We now analyze asymptotic error rates of the two 
procedures regarding the FEM as the underlying data generation mechanism. 

The FE-based estimator 0^ is consistent for /3 under the FEM. The asymptotic nor- 
mality of (11, 12) implies that V^£cd/3-^ 47V(0, y^(/3^, a;)). Let E[Ai?^(/3^)] 
be the expected incremental error rate of the EE procedure under the FEM. From 
Theorem 1, we have 

-\og¥.[ARf{fif)] ^ = _1 min as n ^ oo. (28) 

Denote by 0^ = (0^^, • • • , O^j) the probability vector of the conditional distribution 
[Xi\Y = 1] under the FEM, i.e., 

0,^. = P(X, =j|y = l,FEM), (29) 

for i = 1, • • • and call 0-^ = {9{j)wxj the weight matrix. Recall that the WM- 
based estimator 0"^ is obtained by estimating individually from observed binding 
sites and then transforming the estimates via equation (2). Denote the estimated 

weight matrix by 0™. Since the data are generated by the FEM, 0™40^ and 
y/nd@^ = -^/n(0™ — 0-^) follows a multivariate normal distribution asymptotically. 
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similar to (4), but dO^^ and dO"^ may be correlated {1 < i ^ k < w). Given that the 
coefficients /3 in the FEM are defined w.r.t. a reference sequence, we transform 0™ to 

/3- = log{e^/eoj) - log(C,/^OsJ, for all (30) 

where Si is the ith. nucleotide of the reference sequence Sref- Let A/3™ = /3™ — (3 he 
the deviation of /3™ = {(3'^)wxj- To obtain its asymptotic distribution, we determine 
the cell probability (29) from equation (14), that is, 

where J/^ = /3o + X/3[_i] for X G X"'"!. In particular, 6^^ oc ^os.Eeo {e^V(l + e^')} 
since /^i^^ = 0. For i = 1, • • • ,w and j = 1, • • • , J, we define 

and rewrite 0^- = Ooje^'-^'^^''^ /Zi, where Zi = 0ojC is the normalizing constant. 

^ p f ~ 

Because 6^^ — 9-j and /Jj^. = 6is^ = 0, from equation (30) we have 

/3- ^ log(^,^./S) - ^Og{e{jeosJ = h + (32) 

for all i and j as n — t- go. Thus, 5 = {6ij)uixj is the asymptotic bias of /3™. From the 
asymptotic normality of ^/nd&"^, we see that ^/n{A(3"^ — S) follows a multivariate 
normal distribution with mean and finite covariance matrix as n — t- oo. Note that 
this multivariate normal distribution is defined on a (J — l)w;-dimensional space, since 
A/3- =J,,, = Ofori = l,--- 

Consider the WM-based decision function hJ^{x) = (Sq + xP"^ = h{x) + xA0"^. The 

above derivation shows that ^/n{xA0^ — xS) —>J\f{0, {0^, x)), where the variance 
is determined by the covariance matrix of A/3"^. Following Theorem 2, the expectation 
of the total incremental error rate of the WM procedure 

E[Ai?^(/3™)] ^ tanh|/i(ic)/2| 0o(a^), as oo, (33) 

where i3'^(/3"^) = {x : sign{/i(a;)}{/i(a;) + a;5} < 0}. Similarly, the incremental percent- 
age over the ideal error rate E[A/?^(/3™)]/(i?-^)* is used to compare the predictions of 
the WM and the FE procedures, given that E[Ai?^(/3-'')] (28). Under the FEM, 
the ideal error rate 

{R^y = Ee, {po{X)l{h{X) > 0)+pi{X)l{h{X) < 0)} . (34) 

RecaU that PyiX) = P{Y = y\X), i.e., 

^^(^) = + for y = 0,1. 



ON MOTIF DETECTION 



13 



3.4. FEM with Markov background. Next, we generalize the FEM to Markov back- 
ground and assume that any sequence X € X"^ is generated marginahy by a Markov 
chain. Consistent with Section 3.2, we denote by i/jq = {'^o{x,y))jxj the transition 
probability matrix of the Markov chain. 

It is trivial to see that, with the Markov background model, the ideal decision is 
still h{x) = /3o + xp. If v/{0^ ,x) denotes the asymptotic variance of y/nxdp^ under 
Markov background, then with in place of equation (28) remains valid for the 
FE-based prediction. On the other hand, if we proceed with the WM procedure, the 
expected incremental error rate 

E[Ai2((/3"')] ^ ^ tanh|/i(a;)/2| t/>o(a;), as n ^ oo, (35) 

where i3((/3"^) = {x : sign{/i(a;)}{/i(a;)+(5(a;)} < 0}. Here (5(a;) denotes the asymptotic 
bias of the WM-based decision function for x. A detailed derivation of equation (35) 
and the bias 5{x) is provided in Appendix B. In analogy to the FEM with i.i.d. back- 
ground, E[AR( /{r()* measures the increased error rate of the WM procedure 
relative to the FE procedure, where 

{R(r = E^, {po{X)l{h{X) > 0) +pi{X)l{h{X) < 0)} . 

3.5. Mis- specification of the constant term. In all the above derivations, we have 
assumed that the constant term PoiPo) is fixed to its true value. If this is not the case, 
then the deviation A/3o = /3o — /3o(/3o) will be an extra bias term for an estimated 
decision in which the constant term is fixed to /Sq. More specifically, the set B-l^ (/3™) 
in equation (33) will be replaced by {x : sign{h{x)}{h{x) + xS + A/3o} < 0}, and 
similarly for B'i'{pf) in equation (26) and B({0"^) in equation (35). 

4. Computation. To apply the theoretical results, we need to solve the mini- 
mization (19) and the summation (20) involved in Theorems 1 and 2, respectively. If 
the width of a motif w < 12, brute-force enumeration of all w-mers is computationally 
feasible, which provides exact solutions for both the minimization and the summation 
problems. 

For a motif of width w > 12, we minimize (19) to find a{h) by a two-step approach. 
We generate = 5 x 10^ w-mers from the background model Oq and identify the 
minimum of (19) among them. Then we refine the obtained minimum by simulated 
annealing for 5,000 iterations with temperature decreasing linearly from one to zero. 
At each iteration, we randomly choose one nucleotide Xi from the w positions and 
propose to mutate Xi to one of the other three nucleotide bases with equal probabil- 
ity. The proposal is accepted according to a Metropolis-Hastings ratio with current 
temperature. 

Since the set B{h), as in equations (26), (33) and (35), is usually small, it will be 
very inefficient to approximate the summation by generating w-mers from background 
distributions. Thus, we develop an importance sampling approach to approximate the 
summation (20) when w > 12. Here, we use the calculation of E[Ai?-^(/3"')] (33) to 
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illustrate this approach. Note that one can bound xS in the definition of B-l^ (/3™) 
so that xd G [Mi,M2], where Mi = '^^^^ miiij 5ij and M2 = ^^-^ maxj (5jj. These 
bounds imply that if x {0"^) then 

h{x) G (-M2, 0) U (0, -Ml) := Ji. 

We design a sequential proposal g{X) that is more likely to generate X with h[X) G 
T-i. Suppose that we have generated Xi, - ■ ■ (1 <k < w) from this proposal. Let 

hk-i = /3o + YnZiPiXi, in particular /iq = /So- We determine = min^ /3jj 

and -B^,^! = ^^^^^ maxj /3jj, the bounds for Yli>k l^iXr -^k = j then the range for 
h{X) is 

The larger the overlap between this interval and the more likely that X will belong 
to the desired set 6^0'^). Thus, we propose X^ with probability 

gk{Xk = j \ Xi,--- , Xk-i) oc I [Lkj - e, Ukj + e]n'H\, (36) 

where | • | returns the length of an interval and e is a small positive number to allow 
the generation of Xk = j when L^j = Ukj G T-l. Proposing X^ sequentially by (36) 
for k = 1, - ■ ■ ,w generates an X from g{X). With N proposed samples {X^^^}^^ we 
estimate the summation (33) by 



1 ^ 

-J^tanh /i(xW)/2 



t=i 



6>o(xW)l{xW G B^ip"^)} 

^(xW) • 



In this work, we propose = 5 x 10^ samples for this importance sampling estimation. 
We verified that the estimations were very close to the exact summations. With differ- 
ent bounds for hi{x) and h{x), this approach is applied to other similar summations 
in (26) and (35). 

5. Numerical study. A numerical study was performed under the WMM to 
confirm and quantify the lower predictive efficiency of the FE-based estimator /J-^ 
compared to the WM-based estimator /3™ discussed in Section 3.1. We randomly 
selected 200 TFs from the database TRANSFAC (Matys et al. 2003). For each TF, 
experimentally verified binding sites were used to construct a weight matrix with a 
small amount of pseudo counts. Then we randomly sampled 5,000 human upstream 
sequences, each of length 10 kilo bases, and calculated their nucleotide frequency 9q = 
(0.263, 0.234, 0.237, 0.266). The 200 weight matrices display large variability. The 
width w ranges from 6 to 21 and the information content, X^^i{2 + E.g^(log2 0iXi)}, 
ranges from 5.1 to 17.5 bits (Figure 1). These statistics show that our selection has 
covered the typical width and strength of DNA motifs. 

A constructed weight matrix was regarded as the parameter and the nucleotide 
frequency 6q was used for the i.i.d. background in the WMM. Since the prior odds ratio 
((71 /go) of a binding site over a background site is usually small, we chose three typical 
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Width Information Content (bit) 



Fig 1 . Histograms of the width and the information content of 200 WMs. 



Table 2 
Summary of ARFT {0-'^ , 0"") 



A 


Min. 


Qi 


Median 


Qs 


Max. 


200 
500 
1000 


0.134 
0.183 
0.217 


0.391 
0.475 
0.508 


0.490 
0.555 
0.560 


0.595 
0.638 
0.675 


0.849 
0.849 
0.918 



Qi,3: the first and the third quartiles. 



values for the inverse of the prior odds, A = Qo/qi = 200,500,1000, for numerical 
calculations. We evaluated the AREs of the FE-based prediction w.r.t. the WM-based 
prediction, defined by ARE'"(/3^, /3™) = a"'0f)/a"'{0"') in Section 3.1, for the 200 
WMs. As discussed in Section 4, our evaluation of AREs was exact for WMs of if < 12 
and was carried out with simulated annealing for w > 12. In addition, Monte Carlo 
average was utilized, before simulated annealing, to approximate CoY'^{^/nd0^^) (22) 
by simulating 5 x 10^ w-meics from the i.i.d. background. 

The asymptotic relative efficiencies ARE"'(/3^, /3'") on the 200 TPs are summarized 
in Table 2 for the three inverse prior odds. It is seen that for all the WMs the FE-based 
prediction shows lower efficiency than the WM-based prediction, and that the median 
AREs of 0-f to are between 50% and 60% and the third quartiles {Q3) between 
60% and 70%. Thus, for more than 75% of the TFs, the EE procedure is less than 
70% as efficient as the WM procedure in terms of prediction. This confirms the loss of 
efficiency of the FE-based prediction under the WMM, although both estimators are 
consistent. We note that the increase of ARE with higher A (smaller qi) is consistent 
with the lower bound defined in Proposition 3. 

6. Applications. In this section, we apply the WM and the EE approaches to 
ChlP-seq data and protein binding microarray (PBM) data. We perform cross valida- 
tion (CV) with training data of different size, ranging from 20 to 500 binding sites, 
for two purposes. First, with the large scale of both types of data, we can compare 
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empirical error rates in cross validation against theoretical error rates. This may allow 
us to verify some of the model assumptions and propose further improvement on the 
models. Second, we are also interested in examining the practical performance of the 
two computational methods when the number of observed binding sites varies in a 
wide range, which will provide useful guidance for future applications. 

6.1. ChlP-seq data. In the recent two years, the ChlP-seq technique (Johnson et 
al., 2007; Mikkelsen et al., 2007; Robertson et al., 2007) has become a powerful high- 
throughput method to detect TFBS's in whole genome scale. A binding peak in ChlP- 
seq data can usually narrow down the location of a TFBS to a neighborhood of 50 to 
200 bps (Johnson et al., 2007). ChlP-seq data that contain thousands of binding sites 
for a number of TFs have been generated in a study on mouse embryonic stem cells 
(Chen et al, 2008). We chose five TFs, Esrrb, Oct4, STAT3, Sox2 and cMyc, in this 
study to compare the WM and the FE methods. The five TFs all have well-defined 
weight matrices in literature and each contains more than 2,000 detected binding 
peaks in ChlP-seq, and their data quality was confirmed by motif enrichment analysis 
in Chen et al. (2008). To identify the exact binding site of a ChlP-seq binding peak, we 
searched the 200-bp neighborhood of the peak, 100 bps on each side, to find the best 
match to the known weight matrix of the TF. Given the very small search space, the 
uncertainty in the exact location of the binding site should be minimal. If the motif 
width of a TF is w, background w-mers were extracted from genomic control regions 
that match the locations of the binding sites relative to nearby genes. The ratio of the 
number of background sites over the number of binding sites was set to 200 for every 
TF, that is, the inverse prior odds ratio A = Qo/qi = 200. A transition matrix was 
estimated from the extracted background sites for each TF, since the log Bayes factor 
of a Markov background model over an i.i.d model was > 10^. 

Based on the way we composed the data sets, the WMM with Markov background 
(Section 3.2) seems a more plausible data generation model. Clearly, a data set was a 
mixture of detected binding sites and random background sites, and the background 
distribution was close to a Markov chain. If there is no within-motif dependence, 
binding sites can be regarded as being generated from a WM model, and consequently, 
the WM-based prediction is expected to have a smaller error rate compared to the FE- 
based prediction. However, if there exists within-motif dependence in binding sites, the 
FEM, which is able to capture such dependence, may outperform the WM approach 
regardless of the mixture nature of the data sets. We computed theoretical error rates 
of the two approaches under the WMM with Markov background. For each TF, we 
estimated a WM from all the binding sites and a transition matrix from the background 
sites. Regarding them as the model parameters, we calculated the asymptotic error 
rate of the WM-based prediction, which is the ideal error rate (27), and the incremental 
rate of the FE-based prediction (26). Note that the bias due to mis-specification of the 
constant term in the FE approach needs to be included for the calculation of equation 
(26). These theoretical error rates are reported in Table 3 (the column of n"*" = oo). 

To compare with theoretical results, we performed cross validation to compute em- 
pirical error rates of the WM and the FE procedures on each data set. We randomly 
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Table 3 



Predictive error rates (in 


the unit 


o/lO- 


-'■') for 


ChlP-seq 


data 




TF 


n+ 


20 


50 


100 


200 


500 


CO 




WM 


2.66 


2.49 


2.43 


2.40 


2.36 


2.30 


Esrrb 


EE 


4.30 


2.77 


2.54 


2.45 


2.37 


2.45 




(FE-WM)/WM (%) 


61.7 


11.2 


4.5 


2.1 


0.4 


6.5 




WM 


3.68 


3.54 


3.50 


3.47 


3.44 


2.98 


Oct4 


EE 


4.81 


3.71 


3.53 


3.45 


3.41 


3.06 




(FE-WM)/WM (%) 


30.7 


4.8 


0.9 


-0.6 


-0.9 


2.7 




WM 


3.03 


2.84 


2.78 


2.72 


2.69 


2.57 


STATS 


EE 


4.69 


3.09 


2.84 


2.75 


2.70 


2.74 




(FE-WM)/WM (%) 


54.8 


8.8 


2.2 


1.1 


0.3 


6.6 




WM 


2.95 


2.75 


2.68 


2.65 


2.63 


2.53 


Sox2 


EE 


3.44 


2.89 


2.73 


2.68 


2.66 


2.59 




(FE-WM)/WM (%) 


16.6 


5.1 


1.9 


1.1 


1.1 


2.4 




WM 


2.67 


2.49 


2.42 


2.38 


2.34 


2.07 


cMyc 


EE 


3.30 


2.51 


2.34 


2.26 


2.23 


2.24 




(FE-WM)/WM (%) 


23.6 


0.8 


-3.3 


-5.0 


-4.7 


8.2 



Note: Reported are average error rates over 100 CVs. 



sampled (without replacement) n"*" binding sites and A • n"*" background sites from 
a full data set to form a training set. Both approaches were applied to the training 
set to estimate their respective decision functions. For WM-based prediction, a WM 
and a transition matrix were estimated from the training data set to construct a de- 
cision function (24) with /3o = — log(A). For FE-based prediction, we applied logistic 
regression to the training set to obtain [x) = 1^0 + . Then we predicted the 
class labels of the remaining unused sequences (test set) by each of the two decision 
functions and calculated empirical error rates (CV error rates). This procedure was 
repeated 100 times independently for each value of to obtain the average CV error 
rate. To examine performance with a varying sample size (the number of sequences in 
a training set), we chose n+ from 20 to 500. 

The average CV error rates are reported in Table 3. The theoretical results give a 
reasonable approximation to the CV error rates for both approaches when the training 
sample size n+ > 200. The asymptotic error rates of the WM approach are uniformly 
lower than its CV error rates for all the TFs, while the FE approach achieves a smaller 
CV error rate with = 500 than its asymptotic rate for three TFs. Consequently, the 
incremental percentage of the FE-based prediction for = 500 is less than the ex- 
pected level calculated from the theory. This comparison implies that the WMM may 
not match the exact underlying data generation process, although it is more plausible 
than the FEM given the mixture composition of the data sets. As we discussed, poten- 
tial dependence within a motif may cause possible violation to the WMM. To verify 
our hypothesis, we conducted the x^-test for every pair of motif positions {Xi and X^, 
1 < i < k < w) given the binding sites in each data set. At the significance level of 
0.005, we identified 25, 19, 17, 8, and 12 pairwise correlations for Esrrb, Oct4, STAT3, 
Sox2, and cMyc binding sites, respectively, which gives a false discovery rate of < 2% 
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for all the TFs. By capturing such correlations the FEM is able to achieve compara- 
ble or even slightly better prediction than the WMM with a moderate-size training 
sample (n"*" > 100, Table 3). Finally, it is important to note that even under the exact 
model assumptions of the WMM, the FE-based prediction only results in a marginal 
increment in error rate (< 10%) compared to the WM approach asymptotically (Ta- 
ble 3, n"*" = oo). Together with the superior or comparable CV performance when the 
training size is reasonably large, this result suggests the use of the FE approach, when 
we have a sufficient number of observed binding sites. 

6.2. PBM data. Protein binding microarrays (Mukherjee et al., 2004) provide a 
high throughput means to interrogate protein binding specificity to DNA sequences. 
Quantitative measurement of the binding specificity of a protein to every short nu- 
cleotide sequence designed on a DNA microarray can be obtained simultaneously. The 
PBM data in Berger et al. (2008) quantified DNA binding of homeodomain proteins via 
the calculation of an enrichment score, with an expected false discovery rate (FDR), 
for each double-stranded nucleotide sequence of length eight (u; = 8) . The data set for 
each protein contains 32,896 8-mers, each with an enrichment score and an FDR. We 
identified as the consensus binding pattern for a protein the 8-mer with the highest 
enrichment score, and then labeled as binding sites those 8-mers whose FDR < 0.005 
and which differ by no more than three nucleotides from the consensus after consid- 
ering both the forward and the reverse complement strands. The remaining 8-mers 
were labeled as background sites and we randomly determined their strands (orien- 
tations) to avoid potential artifacts. In this study we included five proteins, Hoxall, 
Irx3, Lhx3, Nkx2.5, and Pou2f2, each from a different family, and called 134, 190, 267, 
145, and 213 binding sites, respectively. 

The FEM, developed by the biophysics of protein-DNA binding, is expected to 
be a better model that matches the design of PBM data than the WMM. Thus, 
theoretical analysis was conducted under the FEM for the five PBM data sets. We 
applied logistic regression to estimate (3 and /3o (9) with all the labeled 8-mers in a 
data set, where the 8-mer 'AAAAAAAA' was regarded as the reference sequence, i.e., 
Pii = 0. We calculated the ideal error rate [Rfy (34) of the FE-based prediction, 
with an i.i.d. uniform background (by design the background distribution is uniform). 
For the WM approach, we chose (3q as the log-ratio of the number of binding sites 
over that of background sites, and calculated its asymptotic error rate by equation 
(33), in which the bias in the constant term (A/3o) was included. The theoretical error 
rates are reported in Table 4 {n'^ = oo), where we find that the WM approach gives 
a significantly higher error rate, between 14% and 56%, than the FE approach. 

The same CV procedure as in the previous section was performed on the PBM data 
sets to compare the empirical predictive error rates of the two approaches, with 
varying between 20 and 100 (Table 4). There is a clear decreasing trend in error rate 
for both approaches with the increase of the training sample size n^, although for 
some data sets the difference between the CV error rate for n"*" = 100 and the asymp- 
totic rate is still quite obvious. Such discrepancy is probably due to the following two 
reasons. First, the parameters (/3, /3o) used for the calculation of asymptotic rates were 
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Table 4 

Predictive error rates (in the unit of 10~"^j for PBM data 


Protein 


n+ 20 50 100 oo 


Hoxall 


EE 3.57 2.62 2.43 1.63 
WM 3.51 3.22 3.05 2.10 
(WM-FE)/FE (%) -1.7 22.9 25.5 28.8 


Irx3 


EE 5.91 4.71 4.54 3.26 
WM 5.06 4.85 4.79 3.71 
(WM-FE)/FE (%) -14.4 3.0 5.5 13.8 


Lhx3 


EE 7.51 4.64 4.20 3.18 
WM 6.90 6.54 6.47 4.97 
(WM-FE)/FE (%) -8.1 40.9 54.0 56.3 


Nkx2.5 


EE 3.73 2.31 2.12 1.80 
WM 3.64 3.29 3.16 2.36 
(WM-FE)/FE (%) -2.4 42.4 49.1 31.1 


Pou2f2 


EE 6.25 4.91 4.56 3.31 
WM 5.79 5.57 5.49 4.02 
(WM-FE)/FE (%) -7.3 13.4 20.4 21.5 



estimated from data sets which only contain 100 to 200 binding sites. This resulted in 
a high variance in the estimated parameters: The median ratio of the standard error 
over the absolute value of an estimated coefficient was between 10% and 30% for the 
five data sets. Second, the training sample size, = 100, is still too small to achieve 
a comparable error rate as — )■ oo. However, we have already seen substantially in- 
creased error rates of the WM-based predictions compared to the FE-based predictions 
for = 100, which is very consistent with the theoretical results. This comparison 
confirms that unless the training sample size is really small, using the WM approach 
may degrade predictive performance dramatically if the data generation mechanism is 
close to the FEM. 

7. Discussion. Combining results on the ChlP-seq data and the PBM data, this 
study provides some general guidance for practical applications of the WM and the 
FE approaches, irrespective of underlying data generation. When the training sample 
size is small, the WM procedure seems to produce fewer errors than the FE procedure. 
But when we have observed enough binding sites, the advantage of the FE procedure 
is clearly seen. On one hand, it gives a comparable or slightly better prediction than 
the WM approach even if the WMM is more likely for the data (Table 3, > 100). 
On the other hand, when the data are generated in a way that matches the biophysical 
process of protein-DNA binding such as the PBM data, the reduction in error rate 
of the FE approach can be substantial compared to the WM approach (Table 4, 
> 50). The relative performance between the two approaches reflects a typical 
variance-bias tradeoff. Estimation under the WMM is simple and more robust, which 
typically has a smaller variance than the FEM. For a small sample size, predictive 
errors are mostly caused by variance in estimation and thus, WM-based predictions 
may outperform FE-based predictions. When the sample size increases, estimation 
variance decreases for both approaches and the potential bias in the WM approach 
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becomes the main factor for predictive errors. Given that its primary principle comes 
from the biophysics of protein-DNA interactions, the FEM has become more attractive, 
based on which many computational methods have been developed for predicting TF- 
DNA binding. In these methods a weight matrix is sometimes used as a first order 
approximation for computing free energy-based binding affinity. This work suggests 
that this approximation must be applied with caution. The results on the PBM data 
have demonstrated that the WM procedure may give a prediction with 50% or more 
errors compared to the FE-based decision for a reasonably large sample size (Table 4) . 

In recent years, a substantial amount of large-scale TF-DNA binding data have been 
generated for many important biological processes. As demonstrated by the applica- 
tions to ChlP-seq data and PBM data, large-sample theory is able to provide valuable 
insights on statistical estimation and prediction for such large-scale data. The results 
in this article can be regarded as a first step towards a theoretical development on 
computational approaches for gene regulation analysis. Incorporation of within-motif 
dependence in the WMM and interaction effects in the FEM is a direct next step 
of this work, for which the model selection component needs to be considered in a 
theoretical analysis. Although desired, further generalizations to methods for de novo 
motif discovery, identification of czs-regulatory modules and predictive modeling of 
gene regulation will be more challenging future directions. 

Appendices. 

Appendix A: Proof of Proposition 3. 

Proof. Let 0k(-j) = 1 ~ ^kj for k = 0,i. The second order partial derivative of the 
marginal log-likelihood l{& \ X) = log{qQOo{X) + qi&{X)} w.r.t. 6ij is 



where @{X) = • 9iXi for Xi = j, (— j) and similarly for 6q{X). Thus, the 

Fisher information on 9ij given X is 



d'^ljG I X) 

del 



gf{e[-.](^H])}' 



{qoOoiX) + qi&{X)] 



2 ' 




Because ]E0|_.j {0o(^[-t])/0[-i](-X^[-.i])} = !> Jensen's inequality implies that 
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The lower bound B{qi,9ij,0Qj) is obtained by dividing the R.H.S. of this inequahty 
by the Fisher information on 6ij given X and Y jointly, 

d^l{&\X,Y)\ qi 



/(%|X,y) = -E. 



where /(0 \ X,Y) = logP(X, F | 0) is the joint log-likelihood. □ 

Appendix B: Derivation 0/ E[Aii{(/3™)] (35). Given the estimated weight matrix 
0*" based on observed binding sites , the constructed decision function of the WM 
approach 

w 

hTix) = log(gi/go) + J^flogC, -logV'o(2;*-i,Xi)] 

i=l 

^ ^o + ±llo/Jf^-lo,^^^p-^],.sn^^ (37) 



i=l 



where Ojj = P{Xi = j \ Y = \) under the FEM with Markov background, (si, • • • , s^) 
is the reference sequence, and 

w 

^0 = \og{qi/qQ) + ^\og{elj'iljQ{si_i,Si)}. 
i=l 

Let a;f_j](s) = [xi,--- , Xj-i, s, Xj+i, • • • ,Xw)- Following a similar derivation in Sec- 



tion 3.3, we have Ofj oc exp(/3jj -|- r]ij), where 



Vij = log ' 



Yl ~, ~H V'oCaJHiO')) > - log < Yl 1 I u '^o{x[-i]{si)) 

^ ^ 1 -|- eP*J I I 1 -|- ^ ^ 



with = /3o + a;[_i]/3[„i]. Since ^i^^ = rjis^ = 0, iog{e{-/9{^,) = Pij + r]ij for all i and j. 
Thus, equation (37) becomes 

hT{x) ^Po + Y I + Vi.. - log M^l^llMj = h{x) + 6{x), 

^ I Vo(Si-l,Si) J 

where (5(£c) = ^^^^ 7?ia,^ - log{V'o(a;i-i, 2;i)/^o(si-i, Sj)}- Let A/if (a;) = hf{x)-h{x). 
The asymptotic normality of y/nd@^^ implies that -y/re{A/i^(a;) — 6{x)} follows a 
limiting normal distribution with mean and a finite (possibly zero) variance for 
every x. Equation (35) then follows from Theorem 2. 
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